LOAN DATA FROM PROSPER by Paul F. Seke

=======================================================================================================================


1. Background, study objectives, and variables

1.1. Background

Peer-to-peer (P2P) lending companies were created to serve the community, by allowing people to lend money to others without the very high expenses, the cumbersome procedure and, allegdly, the tough requirements, high interest rates, and other complex mechanisms used by traditional financial institutions such as banks. We will analyze a dataset containing financial and personal information of people granted loans by the P2P lending company Prosper Marketplace, Inc.. The information was mainly retrieved from the credit profile forms, but information on previous loans and risk score is also available. In addition, the lender yield on the loan, i.e. the net interest margin of a given investor on a given loan is also available.

1.2. Study objectives

The main objective of the present study is to assess how the number of borrowers (people granted a loan) correlates with Prosper risk score, and criteria considered by traditional financial institutions such as borower employment status duration, monthly income, evidence of stated income, and home ownership. We will also explore the impact of delinquency on the lender yield.

1.3. Variables considered of interest:

  • Inquiries’ number (“TotalInquiries” and “InquiriesLast6Months”)
  • the borrower’s employment status duration and income (“StatedMonthlyIncome” and “EmploymentStatusDuration”)
  • the risk score and its variations compared to eventual prior loans, when applicable (“ProsperScore” and “ScorexChangeAtTimeOfListing”)
  • the number of days delinquent (“LoanCurrentDaysDelinquent”)
  • the lender yield on the loan (“LenderYield”)
  • the loan term, amount filled for, and the percentage funded (“Term”, “LoanOriginalAmount”, and “PercentFunded”)
  • the category of the listing that the borrower selected when posting their listing (“ListingCategory”, categorical)
  • the income range of the borrower at the time the listing was created (“IncomeRange”, categorical)
  • wether the borrower had the required documentation to support their income (“IncomeVerifiable”, boolean)
  • the employment status of the borrower at the time they posted the listing (“EmploymentStatus”, categorical)
  • the debt to income ratio of the borrower at the time the credit profile was pulled (“DebtToIncomeRatio”, continuous)
  • the current status of the loan (“LoanStatus”, categorical)
  • whether the borrower is home owner (“IsBorrowerHomeowner”, boolean), i.e. whether the borrowers have a mortgage on their credit profile or provide documentation confirming they are a home owner.
  • and the date the credit profile was pulled (“DateCreditPulled”, factor).

Univariate Plots Section 1

2. Number of previous inquiries

2.1. Distributions of total and last 6 months inquiries

Let’s explore the total number of inquiries at the time the credit profile was pulled and the number of inquiries in the past six months (from the time the credit profile was pulled).

Histogram of total inquiries’ distribution. The histogram revealed a long tail distribution of discrete variables, where 7.47% of borrowers filled a credit profile at Prosper for the first time, 48.55% for the 2nd to the 5th time, 28.68% for the 6th to the 10th, and 15.29% for more than the 10th time. This graph also revealed that a few people filled credit profiles for more than 30 times, suggesting that this may be for business.

Line graph of ordered numbers of last 6 months inquiries. In the last 6 months, 47.95% filled a credit profile once (null number of previous inquiry), 44.82% from 2 to 5 times, and 7.23% for more than 5 times.

Bivariate Plots Section 1

Scatter plot of last 6 months to total inquiries. A strong positive correlation was observed between these two factors (r = 0.74), suggesting that people asking for loans many times in the last 6 months also asked many time in the previous years, further suggesting that these loans were for business purpose and not personal/family needs.

Boxplot graphs of last 6 months to total inquiries. People who filled credit profile forms from 1 to 2 times in the last 6 months tended to have a total number inquiries lower or equal to 7 (probably for personal/family needs). On the other hand, those who filled between 2 and 5 times in the last 6 months had a total inquiry number between about 8 and 23 (possibly small businesses), while those filling between 5 and 15 times had a total inquiry between 24 and 59 (possibly, long term clients of Prosper and larger businesses).

Emerging research questions: For which purposes people were filling credit profiles (filling category)? What is the distribution of the specific years in which the number of inquiries in the last 6 months were determined?


2.2. Filling category of borrowers

Each borrower selected one of the following categories when posting their listing: 0 - Not Available, 1 - Debt consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, and 20 - Wedding Loans.

These categories were re-arranged in a new variable (“AlternativeListingCategory”) with only 5 levels based on data information availability, socio–economic considerations, and purpose:
- category 0, when no information was available, that is for previous categories 0 - Not Available and 7- Other
- category 1, termed as business and taxes, including previous categories 1. Debt Consolidation, 3 - Business, 12 - Green Loans, and 18 - Taxes
- category 2, termed as personal loans and vehicles, including 4 - Personal Loan, 5 - Student Use, 6 - Auto, 9 - Boat, 16 - Motorcycle, and 17 - RV
- category 3, termed as family consolidation and household expenses, including 2 - Home Improvement, 8 - Baby&Adoption, 11 - Engagement Ring, 13 - Household Expenses, 14 - Large Purchases, 19 - Vacation, and 20 - Wedding Loans
- and finally, category 4 termed as medical intervention, including 15 - Medical/Dental and 10 - Cosmetic Procedure.

Multivariate Plots Section 1

2.2.1. Distribution of number of inquiries per listing category

Main observations and comments. Histograms of inquiries faceted by alternative listing category revealed a long tail distribution in all cases. As expected Last 6 months and total inquiries showed the same tendencies. Overall, no information on the listing category was available in 33% of cases for total inquiries (respectively, 39.59% for last 6 months inquiries) (category 0). The proportion of listing categories was determined for the remaing data, and plotted.


2.2.2. Proportion of inquiries per listing category

Bar graphs of proportions per listing category. The listing categories had the following percentages per category for total inquiries (respectively, last 6 months): * for business and taxes (category 1): 8.29% (vs. 50.62%) filling once, then 62.31% (vs. 37.67%) filling up to the 5th of the maximal number of inquiries, then 23.48% (vs. 9.75%) filling up to the 2/3 of the maximal number of inquiries, and finally 5.92% (vs. 1.96%) beyond 2/3 of the maximal number of inquiries; * for personal loans and vehicles (category 2): 6.91% (vs. 40.53%) filling once, then 56.92% (vs. 39.38%), 24.5% (vs. 15%), and 11.66% (vs. 5.09%); * for family consolidation and household expenses (category 3): 6.68% (vs. 41.64%) filling once, then 59.58% (vs. 41.14%), 26.23% (vs. 14.26%), and 7.52% (vs. 2.95%); * for medical interventions (category 4): 7.33% (vs. 44.88%) filling once, then 65.03% (vs. 41.9%), 22.92% (vs. 11.73%), and 4.72% (vs. 14.9%).
Chi-squared test (sample test for equality of proportions with continuity correction) revealed a p value < 0.001 in all cases.

Comments. Although borrowers filling up from the 2/3 of the total to the maximal number of inquiries were less numerous in both distributions (6 or more vs. 4 or more times), the statistical populations of total inquiries and inquiries last 6 months were significantly different for all categories. Notably, the proportion of borrowers inquiring loans only once was very low in total inquiries (many years) while it was very high in a period of 6 months. Similar, the proportion of people filling for up to the 5th of the total amount (2 to 6 vs. 2 or 3 times) was the highest in term of total inquiry, but only the second more common in the last 6 months’ inquiry distribution. Considering that these tendencies were observed all over the categories (with only slight differences), we hypothesize that the borrowers of Prosper tend to keep on borrowing money, while the number of new borrowers is very high. Therefore, the service provided by Prosper would satisfy the customers (explaining their loyalty) and attract new ones.

Disambiguation. Considering that our objective requiries focusing on individuals and families, in the further steps of the study we will exclude people who applied for business and focus on people requiring loans for personal, familial or medical reasons. Unfortunately, the available data does not provide information on whether people who filled credit profile forms before were funded or not. We will assume that people who applied for loans numerous times in a period of 6 months for individual or familial reasons could be people not funded at first attempts and who are in desperate need for a loan.

Emerging research question. What do people who were funded at first attempt (respectively, those who filled the form many times in the last 6 months) had in common?


3. Inquiries, income, home ownership and employment status duration

3.1. Home ownership


Main observations. No correlation was observed between the last 6 months inquries and the employment status duration (r = -0.0016). The scatter plot of last 6 months inquries in function of employment status duration revealed that people with longer employment duration were funded after less attempts, compared to people with short durations. Coloring the graph by home ownership revealed that people who were funded at first or few attempts also tended to be home owners. Faceting the graph by house ownership (1 owner, 0 not owner) and coloring by Prosper risk score confirmed these observations and pointed good prosper score (close to 10) as another common feature among people who were funded. The Proper score is a custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score.

Emerging research question. Any correlation between the monthly income and the chances to be funded?


3.2. Monthly income



Main observations. No correlation was observed between the last 6 months inquries and the monthly income (r = 0.09). The scatter plot of last 6 months inquries in function of the monthly income revealed that people with higher monthly income (above USD 15,000) were funded at first or after less attempts, compared to people with lower income (in particular those with less than USD 5,000). Coloring the graph by home ownership confirmed the observation that most of the people who were funded at first or few attempts were home owners. Faceting the graph by house ownership (1 owner, 0 not owner) and coloring by Prosper risk score revealed that bad prosper score and income lower than USD 5,000 were associated with higher inquiries in a short time (6 months), i.e., probably due to rejections of previous attempts, with or without home ownership.

Emerging research question. What about the loan amount?



Main observations. A poor positive correlation was observed between the loan amount and the monthly income (r = 0.32). As expected, in the scatter plot of these variables the loan amount increased roughly with the monthly income. Coloring the graph by home ownership revealed that most people with a monthly income of USD 5,000 or less were funded only for up to USD 15,000. Conversely, larger amounts were mainly provided to people with incomes higher than USD 5,000. Faceting the graph by house ownership (1 owner, 0 not owner) and coloring by Prosper risk score revealed that most borrowers were home owners and some people with high incomes had a poor Prosper Score (higher risk). Overall, house ownership did not affected the tendency to provide higher amounts mainly to people with higher incomes.


Univariate Plots Section 2

4. Loan amount, risk and gain

4.1. Delinquency

Delinquency was evaluated as the number of days delinquent (passed due payment).


Risk score. The histogram revealed a normal-like distribution, where most borrowers had a risk score between 4 and 8 (median = 6).

Delinquency. Only 15.79 of the borrowers were reported delinquent, as illustrated by the splitted line graphs, which present ordered delinquency days. Delinquent’s histogram revealed a bimodal distribution, with positively skewed distribution that was roughly about 2-two higher than the mode of the other distribution (negatively skewed). The binwidth of the histogram is 60 days.

Emerging research question. What could be the impact of delinquency?


Multivariate Plots Section 2

4.2. Score change and loan status

4.2.1. Loan status


Main observations. The majority of people (beyond 98%) of people either completed their loans or were regular in payments (current). Delinquency (money due in the graph) was rare.

Emerging research question. What could be the impact of delinquency? Do yielders still have a positive return on investment?


Univariate Plots Section 3

4.2.2. Score change at time of listing and lender yield

The borrower’s credit score change at the time the credit profile was pulled is the change relative to the borrower’s last Prosper loan. This value was null if the borrower had no prior loans. Instead, the lender yield on the loan is the interest rate on the loan less the servicing fee.

Score change at time of listing. The histogram revealed an almost zero-centered (median = -3, q25 = -35, q75 = 25) normal distribution. The line graph revealed an inflection point at zero (as 6% of people received no change). More people had a negative score exchange (51.92%), while 42.12% a positive exchange.

Lender yield. The histogram revealed a normal-like distribution (median = 0.173, q25 = 0.124, q75 = 0.24).


Loan amount. As illustrated in the line graph of increasing values (longer steps), 45.48% of borrowers asked for USD 5,000; 27.33% between USD 5,000 and 10,000; 17.58% between USD 10,000 and 15,000; and 9.61% more than USD 15,000. Picks in borrower frequency were observed around: USD 4,000; 5,000; 10,000; and 15,000.

Emerging research question. Any link between the stated monthly income, and the loan amount and term?


Multivariate Plots Section 3

4.2.3. Score change at time of listing and lender yield

The borrower’s credit score change at the time the credit profile was pulled is the change relative to the borrower’s last Prosper loan. This value will be null if the borrower had no prior loans. Instead, the lender yield on the loan is the interest rate on the loan less the servicing fee.


Term and percent funded. Most loans had a term (length of the loan) of 36 months (77.04%), while for 21.54% of borrowers the loan term was 60 months. The listing was fully funded for almost all borrowers (99.24%).

Stated income, loan amount and term. Coloring the scatter plot of stated income vs. loan original amount (which was also the final loan amount given that almost all loans were funded at 100%) revealed that the term increased with both the loan amount and the monthly income. Faceting the graph by term and coloring by prosper score confirmed that people with higher risk (low Prosper score) were funded only for loans under USD 10,000, despite their monthly income. People with the highest incomes also had the best Proper scores and almost all were granted loans beyond USD 15,000, for a term of 36 or 60 days.

Emerging research question. Was the monthly income stated always verifiable (or verified)?


4.2.4. Income verification and Prosper score


Main observations. The income was verifiable in most cases. Not verifiable income amouts were associated with low Prosper score (higher risk). Relatively high loan amounts (more than USD 15,000) were not common in this category.

Emerging research question. Were these measures (granting high amount and long term loans only to people with high Prosper score, monthly income and its verification…) efficient enough to preserve yielders’ return in investment and reduce (and or prevent) delinquency?


4.3. Delinquency and lender yield



Lender yield vs. Prosper score. Yielders gain ranged from 0.05 to about 0.35. The gain increased with the risk (low Prosper score), despite the presence of most delinquent in the high risk category.

Loan amounts vs. Prosper score, . As observed before, in most cases higher Prosper score (less risk) was associated with loan funding. Delinquent were distributed regularly in all loan amounts.

Lender yield vs. loan amount. Yielders gained the most on loan lower than USD 10,000. However, delinquents were also common in that loan amount category.



Final Plots and Summary

After listing and exploring variables names, only the variables juged of interest for the study were extracted from the dataframe of 81 variables obtained from Proper CSV file (to improve memory use and analysis speed). Data were explored in terms of type (integer, boolean…) and quality (sufficient ‘non missing’ data). Where necessary, new variables obtained from transformation of pre-existing ones were created (using functions and iterations). Then data were proceeded for univariate, bivariate and multivariate plottings. Histograms, scatter plots, bar graphs, and line graphs were used, faceted and colored by categorical variables like home ownership, evidence of stated income, loan term, and Prosper score. Various methods were used to improve the quality of graphs and the detection of the information contained in the data, including: decreasing overoplotting (using alpha factor, jitters…); tranforming axes using the sqrt function in a loss less fashion (using coord_cartesian function); removing extreme values (e.g. by plotting data between quantiles 0.1-0.5 and 0.95-0.99); creating new categorical variables from continuous variables or categorical variables with either more general information (e.g. extraction of the delinquency status from the more general categorical variable “LoanStatus”) or with numerous non-organized values (such as the categorical variable “ListingCategory” that had more than 20 possible values). As appropriate, proportions and means were determined for more accurate description of the data.

Plot One

Description One

The boxplot of inquiries in the last 6 months in function of Prosper score and colored by income range (upper) suggested that people with high Prosper score (less risk) were more likely to be funded, including those with very high incomes (> USD 100,000). The combination of that score and of a high monthly income appeared to be associated with funding at first attempt. The boxplot of loan amount in function of Prosper score and colored by income range (lower) suggested that people with both high Prosper score and high incomes were more likely to be funded for high amounts, while people with low Prosper score will not be funded for more than about USD 5,000. These observations summarize well the previous graphs and call for quantitative analysis of the impact of the income range, loan amount and Prosper score (that appeared as the strongest determinant for a loan in our study).

Plot Two

Description Two

The average monthly income in function of Prosper score, faceted by the work status duration and colored by the number of inquiries in the last 6 months further suggested that people with low Prosper score were not funded for more than USD 5,000. More specifically:
(i) for people with high Prosper score who had:
- 3 years or less work status duration: people with high Prosper score who were funded at first inquiry had an average salary of (± SEM) USD 5,901 ± 107 against USD 8,373 ± 1,045 for those who had to ask more times (t-test, p-value = 0.0247);
- 4 to 10 years work status duration: USD 6,614 ± 119 against USD 8,979 ± 895 (t-test, p-value = 0.0115
);
- more than 11 years: USD 7,216 ± 136 against USD 10,091 ± 1,203 (t-test, p-value = 0.0231*);

(ii) for people with low Prosper score who had:
- 3 years or less work status duration: USD 4,576 ± 125 against USD 5,507 ± 309 (t-test, p-value = 0.0059);
- 4 to 10 years work status duration: USD 4,951 ± 109 against USD 6,251 ± 326 (t-test, p-value = 0.0002
);
- more than 11 years: USD 5,862 ± 158 against USD 6,745 ± 287 (t-test, p-value = 0.0076*).

Surprisingly, it appears that most people not funded at first attempt had high Prosper score and a good income. Analysing the average loan income provided a possible explanation, as (iii) people with high Prosper score who had:
- 3 years or less work status duration asked for USD 8,318 ± 171 against USD 11,564 ± 1,457 for those who had to ask more times (t-test, p-value = 0.0516);
- 4 to 10 years work status duration: USD 8,235 ± 145 against USD 11,544 ± 1,061 (t-test, p-value = 0.00308*);
- more than 11 years: USD 8,656 ± 170 against USD 10,302 ± 1,304 (t-test, p-value = 0.2178);
People with high Prosper score also had high income, but some were not funded at the first attempt as they were asking for high amounts.
Therefore, people with high Prosper score and high income were not funded at the first attempt probably because they were asking for high amounts.

(iv) No significant difference in loan amounts was observed between people with low scores inquiring once (USD 5,162 ± 71) or many times (USD 5,251 ± 165, p-value = 0.5982), confirming the graphical observation that people with low Prosper score were not funded for more than USD 5,000 despite their Prosper score, monthly income or work status duration.

Plot Three

Description Three

Quantitative analysis invalidated our assumption that delinquent were common in the high risk category (only 8 over 509 = 1.57%). They appeared instead to be common in the central interval of Prosper score normal distribution (368 over 509 = 72.3% between Prosper scores 3 and 7). On the other hand, quantitative analysis confirmed our hypothesis from qualitative observation that Yielders gain increased with decreasing Prosper score (negative correlation). The average lender yield for Prosper score 1 was the highest (0.291) and Prosper score 10 yield was the lowest (0.198, p-value against score 1 < 2.2e-16).


Reflection

Total inquiries’ and the last 6 months inquiries’ statistical populations were significantly different in all filling categories. People applying for loans many times in the last 6 months also applied many time in the previous years, probably for business purpose, I hypothesized. The last 6 months’ distribution had more new customers than returning customers, possibly indicating that at the time of data pulling Prosper was attracting new borrowers. On the other hand, the proportion of people filling for up to the 5th of the total amount (2 or 3 times in 6 months and 2 to 6 in total) was the highest in term of total inquiry, but only the second more common in the last 6 months’ inquiry distribution, suggesting that the borrowers of Prosper tended to stay and keep on borrowing money. A limitation of the current dataset for our study is the non availability of the reasons why a client (on the basis of unique identifier) borrow money each time he/she comes (could be for different reasons each time. e.g. Personal, medical, business…). It could be interesting for future studies in this direction to retrieve data about the filling category of borrower all the previous times they applied for a loan an whether they got funded or not. Then, the real proportion of people funded in each category and after how money attempts could be accurately determined.
Nonetheless, in order to analyze loans for individual and familial motivations, we excluded people who applied for business at the time the credit profile was pulled. Although home ownership, monthly income, and the employment status duration correlated roughly with the loan funding, term and amount, Prosper risk score appeared to have a far stronger correlation. Quantitative analysis confirmed my qualitative observations. In addition, they also confirmed that delinquency was relatively scarce, but more common among people with Prosper score between 3 and 7, as the income per Prosper score followed a normal distribution. Yielders had a positive return in investment, despite delinquency, particularly for people investing on borrowers with higher risk (low Prosper score). From these results it appears to me that Prosper succeeded in its objective of providing a more humane approach to banking than classical financial institution as customer seem to remain loyal and new customers are attracted. It is also a successful business, as yielders have a return in investment between 19% to 30%, which is far better than the typical interests produced via banks. Future studies may address how the Prosper score is determined and how to improve it for borrowers providing better yield and less delinquency (currently mainly in the low score) to have better scores, and for the category with most delinquent to have a lower score (indicating more risk).


REFERENCES

docs.ggplot2.org - colour_fill_alpha
docs.ggplot2.org - current-theme docs.ggplot2.org - facet_grid
docs.ggplot2.org - geom_bar
docs.ggplot2.org - scale_brewer
docs.ggplot2.org- set-theme
docs.ggplot2.org - themes
en.wikipedia.org - peer-to-peer_lending
en.wikipedia.org - revolving credit
courses.statistics.com - R2prop
rmarkdown.rstudio.com - authoring basics
rmarkdown.rstudio.com - rcodechunks
stackoverflow.com - adding-x-and-y-axis-labels-in-ggplot2
stackoverflow.com - adjust color
stackoverflow.com - create a discrete color palette
stackoverflow.com - how-to-plot-one-variable-in-ggplot
stackoverflow.com - how-drop-data-frame-columns-by-name
stackoverflow.com - grouped-and-stacked-barplot
stackoverflow.com - creating-a-data-frame-from-two-vectors
stackoverflow.com - ggplot2-assigning-colours-to-a-factor
stackoverflow.com - ggplot-how-to-change-facet-labels
stackoverflow.com - how-do-i-manually-change-the-key-labels-in-a-legend-in-ggplot2
stackoverflow.com - how-to-plot-one-variable-in-ggplot
stackoverflow.com - how-to-assign-colors-to-categorical-variables-in-ggplot2
stackoverflow.com - how-to-sort-a-dataframe-by-columns
stackoverflow.com - plot-two-graphs-in-same-plot-in-r
stackoverflow.com - remove-all-of-x-axis-labels-in-ggplot
stackoverflow.com - remove-multiple-objects
stackoverflow.com - simplest-way-to-do-grouped-barplot
stackoverflow.com - turning-off-some-legends
stat.ethz.ch/R-manual
zevross.com - beautiful-plotting-in-r-a-ggplot2-cheatsheet
www.ats.ucla.edu - intro_function
www.cookbook-r.com - graphs colors
www.cookbook-r.com - Plotting_distributions
www.datacamp.com - tutorial-on-loops-in-r
www.fundingcircle
www.lendingmemo.com - lending-club-vs-prosper
www.myfico.com - crediteducation
www.programiz.com - r-if-else-statement
www.prosper.com - landing
www.r-bloggers.com - comparison-of-two-proportions
www.r-bloggers.com - one-way-analysis-of-variance
r-bloggers.com - from-continuous-to-categorical
www.r-tutor.com - two-population-proportions
www.statmethods.net - graphs/bar
www.statmethods.net - ttest
www.sthda.com - be-awesome-in-ggplot2
www.sthda.com - ggplot2-facet-split-a-plot-into-a-matrix-of-panels
www.sthda.com - ggplot2-colors
www.theanalysisfactor.com - r-tutorial-13